Skip to main content

Introduction to Semi-Supervised Learning

Semi-supervised learning is a machine learning paradigm that falls between supervised and unsupervised learning. It leverages both labeled and unlabeled data for training, using a small amount of labeled data along with a large amount of unlabeled data.

The Need for Semi-Supervised Learning

In many real-world scenarios, obtaining labeled data is:

  • Expensive: Requires human experts to annotate
  • Time-consuming: Manual labeling can take significant time
  • Sometimes impossible: Some domains have inherent labeling constraints

Meanwhile, unlabeled data is typically:

  • Abundant: Can be collected automatically
  • Inexpensive: No human annotation required
  • Contains valuable information: Reveals underlying data distribution

Semi-supervised learning bridges this gap by leveraging both types of data.

Core Assumptions

Semi-supervised learning relies on specific assumptions about the relationship between data distribution and the target function:

1. Smoothness Assumption

  • Points that are close to each other are likely to have the same label
  • The decision boundary should pass through low-density regions

2. Cluster Assumption

  • Data points tend to form distinct clusters
  • Points in the same cluster are likely to have the same label

3. Manifold Assumption

  • High-dimensional data lies on a low-dimensional manifold
  • Learning the manifold structure from unlabeled data helps classification

Types of Semi-Supervised Learning

Inductive Semi-Supervised Learning

  • Goal: Learn a function that can predict labels for unseen data
  • Uses labeled and unlabeled data during training
  • Once trained, can make predictions without unlabeled data

Transductive Semi-Supervised Learning

  • Goal: Predict labels for specific unlabeled examples used during training
  • No generalization to new, unseen data points
  • Example: Graph-based methods that propagate labels directly

Common Approaches

Self-Training (Pseudo-Labeling)

  • Train a model on labeled data
  • Use the model to predict labels for unlabeled data
  • Add high-confidence predictions to the labeled dataset
  • Retrain the model iteratively

Co-Training

  • Train multiple models on different views/features of the data
  • Each model labels unlabeled data for the other models
  • Requires data with naturally occurring different views or artificially split features

Generative Models

  • Model the joint distribution of data and labels
  • Use labeled data to learn conditional distributions
  • Use unlabeled data to better estimate the data distribution

Graph-based Methods

  • Construct a graph where nodes are data points
  • Connect similar instances with weighted edges
  • Propagate labels from labeled to unlabeled nodes based on graph structure

Semi-Supervised Support Vector Machines (S3VM)

  • Extend traditional SVMs to include unlabeled data
  • Find a decision boundary that separates labeled data while passing through low-density regions

Performance Considerations

When Semi-Supervised Learning Works Well

  • When assumptions hold true for the data
  • When labeled data is scarce but high quality
  • When unlabeled data provides useful structure information

When It Can Fail

  • When assumptions are violated
  • When labeled data is too scarce to bootstrap learning
  • When poor predictions on unlabeled data lead to error propagation

Applications

  • Text Classification: Using small sets of labeled documents with large unlabeled corpora
  • Image Recognition: Leveraging abundant unlabeled images with few labeled examples
  • Medical Diagnosis: Using limited diagnosed cases with many undiagnosed medical records
  • Speech Recognition: Combining transcribed and untranscribed audio samples
  • Protein Structure Prediction: Using known structures to help predict unknown ones
  • Web Content Classification: Categorizing web pages with limited manual annotations

Evaluation

Evaluating semi-supervised learning methods requires careful consideration:

  • Hold-out labeled data for testing
  • Compare against supervised learning with only labeled data
  • Compare against unsupervised + supervised two-step approaches
  • Measure performance as a function of labeled/unlabeled ratio

Recent Advances

  • MixMatch: Combines consistency regularization with entropy minimization
  • FixMatch: Simplifies semi-supervised learning with consistency regularization
  • UDA (Unsupervised Data Augmentation): Uses data augmentation for consistency regularization
  • Mean Teachers: Temporal ensembling approach using model weight averaging
  • Virtual Adversarial Training: Adds adversarial perturbations to enforce consistency

By effectively leveraging both labeled and unlabeled data, semi-supervised learning offers a powerful approach for many real-world problems where labeled data is limited but unlabeled data is plentiful.